Semantic  Role  Labeling  via  Integer  Linear  Programming  Inference 


Vasin  Punyakanok  Dan  Roth  Wen-tau  Yih  Dav  Zimak 

Department  of  Computer  Science 
University  of  Illinois  at  Urbana-Champaign 

{punyakan, danr, yih, davzimak}@uiuc . edu 


Abstract 

We  present  a  system  for  the  semantic  role  la¬ 
beling  task.  The  system  combines  a  machine 
learning  technique  with  an  inference  procedure 
based  on  integer  linear  programming  that  sup¬ 
ports  the  incorporation  of  linguistic  and  struc¬ 
tural  constraints  into  the  decision  process.  The 
system  is  tested  on  the  data  provided  in  CoNLL- 
2004  shared  task  on  semantic  role  labeling  and 
achieves  very  competitive  results. 

1  Introduction 

Semantic  parsing  of  sentences  is  believed  to  be  an 
important  task  toward  natural  language  understand¬ 
ing,  and  has  immediate  applications  in  tasks  such 
information  extraction  and  question  answering.  We 
study  semantic  role  labeling(SRL).  For  each  verb  in 
a  sentence,  the  goal  is  to  identify  all  constituents 
that  fill  a  semantic  role,  and  to  determine  their  roles, 
such  as  Agent,  Patient  or  Instrument,  and  their  ad¬ 
juncts,  such  as  Locative,  Temporal  or  Manner. 

The  PropBank  project  (Kingsbury  and  Palmer, 
2002)  provides  a  large  human- annotated  corpus 
of  semantic  verb-argument  relations.  Specifically, 
we  use  fhe  data  provided  in  the  CoNLL-2004 
shared  task  of  semantic -role  labeling  (Carreras  and 
Marquez,  2003)  which  consists  of  a  portion  of  the 
PropBank  corpus,  allowing  us  to  compare  the  per¬ 
formance  of  our  approach  with  other  systems. 

Previous  approaches  to  the  SRL  task  have  made 
use  of  a  full  syntactic  parse  of  the  sentence  in  or¬ 
der  to  define  argument  boundaries  and  to  determine 
the  role  labels  (Gildea  and  Palmer,  2002;  Chen  and 
Rambow,  2003;  Gildea  and  Hockenmaier,  2003; 
Pradhan  et  al.,  2003;  Pradhan  et  al.,  2004;  Sur- 
deanu  et  al.,  2003).  In  this  work,  following  the 
CoNLL-2004  shared  task  definition,  we  assume  that 
the  SRL  system  takes  as  input  only  partial  syn¬ 
tactic  information,  and  no  external  lexico-semantic 
knowledge  bases.  Specifically,  we  assume  as  input 
resources  a  part-of-speech  tagger,  a  shallow  parser 
that  can  process  the  input  to  the  level  of  based 
chunks  and  clauses  (Tjong  Kim  Sang  and  Buch- 


holz,  2000;  Tjong  Kim  Sang  and  Dejean,  2001), 
and  a  named-entity  recognizer  (Tjong  Kim  Sang 
and  De  Meulder,  2003).  We  do  not  assume  a  full 
parse  as  input. 

SRL  is  a  difficult  task,  and  one  cannot  expect 
high  levels  of  performance  from  either  purely  man¬ 
ual  classifiers  or  purely  learned  classifiers.  Rather, 
supplemental  linguistic  information  must  be  used 
to  support  and  correct  a  learning  system.  So  far, 
machine  learning  approaches  to  SRL  have  incorpo¬ 
rated  linguistic  information  only  implicitly,  via  the 
classifiers’  feafures.  The  key  innovation  in  our  ap¬ 
proach  is  the  development  of  a  principled  method  to 
combine  machine  learning  techniques  with  linguis¬ 
tic  and  structural  constraints  by  explicitly  incorpo¬ 
rating  inference  into  the  decision  process. 

In  the  machine  learning  part,  the  system  we 
present  here  is  composed  of  two  phases.  First,  a 
set  of  argument  candidates  is  produced  using  two 
learned  classifiers — one  fo  discover  beginning  po- 
sifions  and  one  fo  discover  end  posifions  of  each 
argument  type.  Hopefully,  this  phase  discovers  a 
small  superset  of  all  arguments  in  the  sentence  (for 
each  verb).  In  a  second  learning  phase,  the  candi¬ 
date  arguments  from  the  first  phase  are  re-scored 
using  a  classifier  designed  to  determine  argument 
type,  given  a  candidate  argument. 

Unfortunately,  it  is  difficult  to  utilize  global  prop¬ 
erties  of  the  sentence  into  the  learning  phases. 
However,  the  inference  level  it  is  possible  to  in¬ 
corporate  the  fact  that  the  set  of  possible  role- 
labelings  is  restricted  by  both  structural  and  lin¬ 
guistic  constraints — for  example,  arguments  cannot 
structurally  overlap,  or,  given  a  predicate,  some  ar¬ 
gument  structures  are  illegal.  The  overall  decision 
problem  must  produce  an  outcome  that  consistent 
with  these  constraints.  We  encode  the  constraints  as 
linear  inequalities,  and  use  integer  linear  program- 
ming(ILP)  as  an  inference  procedure  to  make  a  fi¬ 
nal  decision  that  is  both  consistent  with  the  con¬ 
straints  and  most  likely  according  to  the  learning 
system.  Although  ILP  is  generally  a  computation¬ 
ally  hard  problem,  there  are  efficient  implementa- 
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tions  that  can  run  on  thousands  of  variables  and  con¬ 
straints.  In  our  experiments,  we  used  the  commer¬ 
cial  ILP  package  (Xpress-MP,  2003),  and  were  able 
to  process  roughly  twenty  sentences  per  second. 

2  Task  Description 

The  goal  of  the  semantic-role  labeling  task  is  to  dis¬ 
cover  the  verb-argument  structure  for  a  given  input 
sentence.  For  example,  given  a  sentence  “  I  left  my 
pearls  to  my  daughter-in-law  in  my  will”,  the  goal  is 
to  identify  different  arguments  of  the  verb  left  which 
yields  the  output: 

[ao  I]  [v  left  ]  [ai  my  pearls]  [a2  to  my  daughter- 
in-law]  [am-loc  in  my  will]. 

Here  AO  represents  the  leaver,  Al  represents  the 
thing  left,  A2  represents  the  benefactor,  AM-LOC 
is  an  adjunct  indicating  the  location  of  the  action, 
and  V  determines  the  verb. 

Following  the  definition  of  the  PropBank,  and 
CoNLL-2004  shared  task,  there  are  six  different 
types  of  arguments  labelled  as  A0-A5  and  AA. 
These  labels  have  different  semantics  for  each  verb 
as  specified  in  fhe  PropBank  Frame  files.  In  addi¬ 
tion,  fhere  are  also  13  types  of  adjuncfs  labelled  as 
AM -XXX  where  XXX  specifies  fhe  adjuncf  type. 
In  some  cases,  an  argumenf  may  span  over  differ- 
enf  parfs  of  a  senfence,  fhe  label  C-XXX  is  used  fo 
specify  fhe  confinuify  of  fhe  argumenfs,  as  shown  in 
fhe  example  below. 

[ai  The  pearls]  ,  [ao  I]  [v  said]  ,  [c-ai  were  lefl 
fo  my  daughler-in-law] . 

Moreover  in  some  cases,  an  argumenf  mighf  be  a 
relative  pronoun  fhaf  in  facl  refers  fo  fhe  acfual  agenf 
oufside  fhe  clause.  In  fhis  case,  fhe  acfual  agenf  is  la¬ 
beled  as  fhe  appropriafe  argumenf  fype,  XXX,  while 
fhe  relafive  pronoun  is  instead  labeled  as  R-XXX. 
For  example, 

[ai  The  pearls]  [r.ai  which]  [ao  I]  [v  left]  ,  [a2 
fo  my  daughter-in-law]  are  fake. 

See  fhe  defails  of  fhe  definition  in  Kingsbury  and 
Palmer  (2002)  and  Carreras  and  M^quez  (2003). 

3  System  Architecture 

Our  semanfic  role  labeling  sysfem  consisfs  of  fwo 
phases.  The  firsl  phase  finds  a  subsef  of  argumenfs 
from  all  possible  candidafes.  The  goal  here  is  fo 
filler  oul  as  many  as  possible  false  argumenf  candi- 
dales,  while  sfill  mainfaining  high  recall.  The  sec¬ 
ond  phase  focuses  on  idenlifying  fhe  types  of  fhose 
argumenf  candidafes.  Since  fhe  number  of  candi- 
dales  is  much  fewer,  fhe  second  phase  is  able  fo  use 


slighfly  complicated  fealures  fo  facililale  learning 
a  beffer  classifier.  This  seclion  firsl  inlroduces  fhe 
learning  sysfem  we  use  and  Ihen  describes  how  we 
learn  fhe  classifiers  in  Ihese  fwo  phases. 

3.1  SNoW  Learning  Architecture 

The  learning  algorilhm  used  is  a  variation  of  the 
Winnow  update  rule  incorporated  in  SNoW  (Roth, 
1998;  Roth  and  Yih,  2002),  a  multi-class  classifier 
that  is  specifically  tailored  for  large  scale  learning 
tasks.  SNoW  learns  a  sparse  network  of  linear  func¬ 
tions,  in  which  the  targets  (argument  border  predic¬ 
tions  or  argument  type  predictions,  in  this  case)  are 
represented  as  linear  functions  over  a  common  fea¬ 
ture  space.  It  incorporates  several  improvements 
over  the  basic  Winnow  multiplicative  update  rule. 
In  particular,  a  regularization  term  is  added,  which 
has  the  effect  of  trying  to  separate  the  data  with  a 
thick  separator  (Grove  and  Roth,  2001;  Hang  et  ah, 
2002).  In  the  work  presented  here  we  use  this  regu¬ 
larization  with  a  fixed  parameter. 

Experimental  evidence  has  shown  that  SNoW 
activations  are  monotonic  with  the  confidence  in 
the  prediction.  Therefore,  it  can  provide  a  good 
source  of  probability  estimation.  We  use  soft- 
max  (Bishop,  1995)  over  the  raw  activation  values 
as  conditional  probabilities,  and  also  the  score  of  the 
target.  Specifically,  suppose  the  number  of  classes 
is  n,  and  the  raw  activation  values  of  class  i  is  acti. 
The  posterior  estimation  for  class  i  is  derived  by  the 
following  equation. 

score(z)  =  pi  =  ^ ^ 

The  score  plays  an  important  role  in  different 
places.  For  example,  the  first  phase  uses  the  scores 
to  decide  which  argument  candidates  should  be  fil¬ 
tered  out.  Also,  the  scores  output  by  the  second- 
phase  classifier  are  used  in  the  inference  procedure 
to  reason  for  the  best  global  labeling. 

3.2  First  Phase:  Find  Argument  Candidates 

The  first  phase  is  to  predict  the  argument  candidates 
of  a  given  sentence  that  correspond  to  the  active 
verb.  Unfortunately,  it  turns  out  that  it  is  difficult  to 
predict  the  exact  arguments  accurately.  Therefore, 
the  goal  here  is  to  output  a  superset  of  the  correct 
arguments  by  filtering  out  unlikely  candidates. 

Specifically,  we  learn  two  classifiers,  one  to  de¬ 
tect  beginning  argument  locations  and  the  other 
to  detect  end  argument  locations.  Each  multi¬ 
class  classifier  makes  predictions  over  forty-three 
classes — thirty-two  argument  types,  ten  continuous 


argument  types,  and  one  elass  to  deteet  not  begin¬ 
ning/not  end.  Features  used  for  these  elassifiers  are: 

•  Word  feature  ineludes  the  eurrent  word,  two 
words  before  and  two  words  after. 

•  Part-of-speech  tag  (POS)  feature  ineludes  the 
POS  tags  of  all  words  in  a  window  of  size  two. 

•  Chunk  feature  ineludes  the  BIO  tags  for 
ehunks  of  all  words  in  a  window  of  size  two. 

•  Predicate  lemma  &  POS  tag  show  the  lemma 
form  and  POS  tag  of  the  aetive  predieate. 

•  Voice  feature  is  the  voiee  (aetive/passive)  of 
the  eurrent  predieate.  This  is  extraeted  with  a 
simple  rule:  a  verb  is  identified  as  passive  if  it 
follows  a  to-be  verb  in  the  same  phrase  ehunk 
and  its  POS  tag  is  VBN(past  partieiple)  or  it 
immediately  follows  a  noun  phrase. 

•  Position  feature  deseribes  if  the  eurrent  word 
is  before  or  after  the  predieate. 

•  Chunk  pattern  eneodes  the  sequenee  of 
ehunks  from  the  eurrent  words  to  the  predieate. 

•  Clause  tag  indieates  the  boundary  of  elauses. 

•  Clause  path  feature  is  a  path  formed  from  a 
semi-parsed  tree  eontaining  only  elauses  and 
ehunks.  Eaeh  elause  is  named  with  the  ehunk 
preeeding  it.  The  elause  path  is  the  path  from 
predieate  to  target  word  in  the  semi-parse  tree. 

•  Clause  position  feature  is  the  position  of  the 
target  word  relative  to  the  predieate  in  the 
semi-parse  tree  eontaining  only  elauses.  There 
are  four  eonfigurations  -  target  word  and  pred¬ 
ieate  share  the  same  parent,  target  word  parent 
is  an  aneestor  of  predieate,  predieate  parent  is 
an  aneestor  of  target  word,  or  otherwise. 

Beeause  eaeh  argument  eonsists  of  a  single  be¬ 
ginning  and  a  single  ending,  these  elassifiers  ean  be 
used  to  eonstruet  a  set  of  potential  arguments  (by 
eombining  eaeh  predieted  begin  with  eaeh  predieted 
end  after  it  of  the  same  type). 

Although  this  phase  identifies  fyped  argumenfs 
(i.e.  labeled  wifh  argumenf  fypes),  fhe  seeond  phase 
will  re-seore  eaeh  phrase  using  phrase-based  elassi¬ 
fiers  -  fherefore,  fhe  goal  of  fhe  firsf  phase  is  sim¬ 
ply  fo  idenfify  non-fyped  phrase  eandidafes.  In  fhis 
fask,  we  aehieves  98.96%  and  88.65%  reeall  (over¬ 
all,  wifhouf  verb)  on  fhe  fraining  and  fhe  develop- 
menf  sef,  respeefively.  Beeause  fhese  are  fhe  only 
eandidafes  passed  fo  fhe  seeond  phase,  fhe  final  sys- 
fem  performanee  is  upper-bounded  by  88.65%. 

3.3  Second  Phase:  Argument  Classification 

The  seeond  phase  of  our  system  assigns  the  final  ar¬ 
gument  elasses  to  (a  subset)  of  the  argument  ean- 


didates  supplied  from  the  first  phase.  Again,  the 
SNoW  learning  arehiteeture  is  used  to  train  a  multi- 
elass  elassifier  to  label  eaeh  argument  to  one  of  the 
argument  types,  plus  a  speeial  elass — no  argument 
(null).  Training  examples  are  ereated  from  the  argu¬ 
ment  eandidafes  supplied  from  the  first  phase  using 
the  following  features: 

•  Predicate  lemma  &  POS  tag,  voice,  position, 
clause  Path,  clause  position,  chunk  pattern 

Same  features  as  those  in  the  first  phase. 

•  Word  &  POS  tag  from  the  argument,  inelud¬ 
ing  the  tirst,last,and  head^  word  and  tag. 

•  Named  entity  feature  tells  if  the  target  argu¬ 
ment  is,  embeds,  overlaps,  or  is  embedded  in  a 
named  entity  with  its  type. 

•  Chunk  tells  if  the  target  argument  is,  embeds, 
overlaps,  or  is  embedded  in  a  ehunk  with  its 
type. 

•  Lengths  of  the  target  argument,  in  the  numbers 
of  words  and  ehunks  separately. 

•  Verb  class  feature  is  the  elass  of  the  aetive 
predieate  deseribed  in  PropBank  Frames. 

•  Phrase  type  uses  simple  heuristies  to  identify 
the  target  argument  as  VP,  PP,  or  NP. 

•  Suh-categorization  deseribes  the  phrase 
strueture  around  the  predieate.  We  separate 
the  elause  where  the  predieate  is  in  into  three 
parts — the  predieate  ehunk,  segments  before 
and  after  the  predieate,  and  use  the  sequenee 
of  phrase  types  of  these  three  segments. 

•  Baseline  features  identified  not  in  the  main 
verb  ehunk  as  AM-NEG  and  modal  verb  in  the 
main  verb  ehunk  as  AM-MOD. 

•  Clause  coverage  describes  how  much  of  the 
local  clause  (from  the  predicate)  is  covered  by 
the  target  argument. 

•  Chunk  pattern  length  feature  counts  the  num¬ 
ber  of  patterns  in  the  argument. 

•  Conjunctions  join  every  pair  of  the  above  fea¬ 
tures  as  new  features. 

•  Boundary  words  &  POS  tag  include  two 
words/tags  before  and  after  the  target  argu¬ 
ment. 

•  Bigrams  are  pairs  of  words/tags  in  the  window 
from  two  words  before  the  target  to  the  first 
word  of  the  target,  and  also  from  the  last  word 
to  two  words  after  the  argument. 

’We  use  simple  rules  to  first  decide  if  a  candidate  phrase 
type  is  VP,  NP,  or  PP.  The  headword  of  an  NP  phrase  is  the 
right-most  noun.  Similarly,  the  left-most  verh/proposition  of  a 
VP/PP  phrase  is  extracted  as  the  headword 


•  Sparse  collocation  picks  one  word/tag  from 
the  two  words  before  the  argument,  the  first 
word/tag,  the  last  word/tag  of  the  argument, 
and  one  word/tag  from  the  two  words  after  the 
argument  to  join  as  features. 

Although  the  predictions  of  the  second-phase 
classifier  can  be  used  direcfly,  fhe  labels  of  argu- 
menfs  in  a  senfence  often  violate  some  consfrainfs. 
Therefore,  we  rely  on  fhe  inference  procedure  fo 
make  fhe  final  predictions. 

4  Inference  via  ILP 

Ideally,  if  fhe  learned  classifiers  are  perfecf,  argu- 
menfs  can  be  labeled  correcfly  according  fo  fhe  clas¬ 
sifiers’  predicfions.  In  realify,  labels  assigned  fo  ar- 
gumenfs  in  a  senfence  offen  confradicf  each  ofher, 
and  violafe  fhe  consfrainfs  arising  from  fhe  sfruc- 
fural  and  linguistic  information.  In  order  fo  resolve 
fhe  conflicfs,  we  design  an  inference  procedure  fhaf 
lakes  fhe  confidence  scores  of  each  individual  ar- 
gumenl  given  by  fhe  second-phase  classifier  as  in¬ 
pul,  and  oulpuls  fhe  best  global  assignmenl  lhal 
also  satisfies  fhe  consfrainfs.  In  Ihis  section  we  firsl 
inlroduce  fhe  consfrainfs  and  fhe  inference  prob¬ 
lem  in  fhe  semantic  role  labeling  lask.  Then,  we 
demonslrale  how  we  apply  integer  linear  program- 
ming(ILP)  fo  reason  for  fhe  global  label  assignmenl. 

4.1  Constraints  over  Argument  Labeling 

Formally,  the  argument  classifier  attempts  to  assign 
labels  to  a  set  of  arguments,  ,  indexed  from  1 
to  M.  Each  argument  S'*  can  take  any  label  from  a 
set  of  argument  labels,  V,  and  the  indexed  set  of 
arguments  can  take  a  set  of  labels,  G  . 
If  we  assume  that  the  classifier  returns  a  score, 
score  (S*  =  c*),  corresponding  to  the  likelihood  of 
seeing  label  c*  for  argument  S*,  then,  given  a  sen¬ 
tence,  the  unaltered  inference  task  is  solved  by  max¬ 
imizing  the  overall  score  of  the  arguments, 

=  argmax  score(S^'^  =  c^'^) 

^1:M  ^-pM 

M  (1) 

=  argmax  >  scorefS*  =  c*). 

cl:MgpM  ^ 

In  the  presence  of  global  constraints  derived  from 
linguistic  information  and  structural  considerations, 
our  system  seeks  for  a  legitimate  labeling  that  max¬ 
imizes  the  score.  Specifically,  it  can  be  viewed  as 
the  solution  space  is  limited  through  the  use  of  a  fil¬ 
ter  function,  T,  that  eliminates  many  argument  la¬ 
belings  from  consideration.  It  is  interesting  to  con¬ 
trast  this  with  previous  work  that  filters  individual 
phrases  (see  (Carreras  and  Marquez,  2003)).  Here, 


we  are  concerned  with  global  constraints  as  well  as 
constraints  on  the  arguments.  Therefore,  the  final 
labeling  becomes 

M 

=  argmax  score  (5*  =  c*)  (2) 

The  filter  function  used  considers  the  following  con¬ 
straints: 

1.  Arguments  cannot  cover  the  predicate  except 
those  that  contain  only  the  verb  or  the  verb  and 
the  following  word. 

2.  Arguments  cannot  overlap  with  the  clauses 
(they  can  be  embedded  in  one  another). 

3.  If  a  predicate  is  outside  a  clause,  its  arguments 
cannot  be  embedded  in  that  clause. 

4.  No  overlapping  or  embedding  arguments. 

5.  No  duplicate  argument  classes  for  A0-A5,V. 

6.  Exactly  one  V  argument  per  verb. 

7.  If  there  is  C-V,  then  there  should  be  a  sequence 
of  consecutive  V,  Al,  and  C-V  pattern.  Eor  ex¬ 
ample,  when  split  is  the  verb  in  “split  it  up”, 
the  Al  argument  is  “it”  and  C-V  argument  is 
“up”. 

8.  If  there  is  an  R-XXX  argument,  then  there  has 
to  be  an  XXX  argument.  That  is,  if  an  ar¬ 
gument  is  a  reference  to  some  other  argument 
XXX,  then  this  referenced  argument  must  exist 
in  the  sentence. 

9.  If  there  is  a  C-XXX  argument,  then  there  has 
to  be  an  XXX  argument;  in  addition,  the  C- 
XXX  argument  must  occur  after  XXX.  This  is 
stricter  than  the  previous  rule  because  the  order 
of  appearance  also  needs  to  be  considered. 

10.  Given  the  predicate,  some  argument  classes 
are  illegal  (e.g.  predicate  ’stalk’  can  take  only 
AO  or  Al).  This  linguistic  information  can  be 
found  in  PropBank  Frames. 

We  reformulate  the  constraints  as  linear 
(in)equalities  by  introducing  indicator  variables. 
The  optimization  problem  (Eq.  2)  is  solved  using 
lEP. 

4.2  Using  Integer  Linear  Programming 

As  discussed  previously,  a  collection  of  potential  ar¬ 
guments  is  not  necessarily  a  valid  semantic  label¬ 
ing  since  it  must  satisfy  all  of  the  constraints.  In 
this  context,  inference  is  the  process  of  finding  the 
best  (according  to  Equation  1)  valid  semantic  labels 
that  satisfy  all  of  the  specified  constraints.  We  take 
a  similar  approach  that  has  been  previously  used 


for  entity /relation  reeognition  (Roth  and  Yih,  2004), 
and  model  this  inferenee  proeedure  as  solving  an 
ILP. 

An  integer  linear  program{lhF)  is  basieally  the 
same  as  a  linear  program.  The  eost  funetion  and  the 
(in)equality  eonstraints  are  all  linear  in  terms  of  the 
variables.  The  only  differenee  in  an  ILP  is  the  vari¬ 
ables  ean  only  take  integers  as  their  values.  In  our 
inferenee  problem,  the  variables  are  in  faet  binary. 
A  general  binary  integer  programming  problem  ean 
be  stated  as  follows. 

Given  a  eost  veetor  p  G  a  set  of  variables, 
z  =  {zi, . . .  ,Zd)  and  eost  matriees  Ci  G  3^*^  x 
3^*^,  C2  G  3^*^  x3^‘^ ,  where  ti  and  t2  are  the  numbers 
of  inequality  and  equality  eonstraints  and  d  is  the 
number  of  binary  variables.  The  ILP  solution  z*  is 
the  veetor  that  maximizes  the  eost  funetion, 
z*  =  argmaxp  •  z, 
ze{o,i}'* 

subject  to  c,z>b,.andCsz  =  be, 

where  bi,  b2  G  3f?'^,  and  for  all  2:  G  z,  z  G  {0, 1}. 

To  solve  the  problem  of  Equation  2  in  this  set¬ 
ting,  we  first  reformulate  the  original  eost  funetion 
score(5*  =  c*)  as  a  linear  funetion  over  sev¬ 
eral  binary  variables,  and  then  represent  the  filter 
funetion  T  using  linear  inequalities  and  equalities. 

We  set  up  a  bijeetion  from  the  semantie  labeling 
to  the  variable  set  z.  This  is  done  by  setting  z  to  a  set 
of  indieator  variables.  Speeiheally,  let  Zic  =  [5*  = 
c]  be  the  indieator  variable  that  represents  whether 
or  not  the  argument  type  c  is  assigned  to  S'*,  and 
let  Pic  =  score(5*  =  c).  Equation  1  can  then  be 
written  as  an  lEP  cost  function  as 

M  \V\ 

argmax  'V]  'V]  picZic, 

subject  to 

\r\ 

^  ^  Zic  —  1  Zic  ^ 

c=l 

which  means  that  each  argument  can  take  only  one 
type.  Note  that  this  new  constraint  comes  from  the 
variable  transformation,  and  is  not  one  of  the  con¬ 
straints  used  in  the  filter  function  T . 

Constraints  1  through  3  can  be  evaluated  on  a  per- 
argument  basis  -  the  sake  of  efficiency,  arguments 
that  violate  these  constraints  are  eliminated  even 
before  given  the  second-phase  classifier.  Next,  we 
show  how  to  transform  the  constraints  in  the  filter 
function  into  the  form  of  linear  (in)equalities  over 
z,  and  use  them  in  this  lEP  setting. 


Constraint  4:  No  overlapping  or  embedding  If 

arguments  5-^* ,  •  • . ,  occupy  the  same  word  in  a 
sentence,  then  this  constraint  restricts  only  one  ar¬ 
guments  to  be  assigned  to  an  argument  type.  In 
other  words,  k  —  1  arguments  will  be  the  special 
class  null,  which  means  the  argument  candidate  is 
not  a  legitimate  argument.  If  the  special  class  null 
is  represented  by  the  symbol  cj),  then  for  every  set  of 
such  arguments,  the  following  linear  equality  repre¬ 
sents  this  constraint. 

k 

= k — 1 

i=l 

Constraint  5:  No  duplicate  argument  classes 

Within  the  same  sentence,  several  types  of  argu¬ 
ments  cannot  appear  more  than  once.  Eor  example, 
a  predicate  can  only  take  one  AO.  This  constraint 
can  be  represented  using  the  following  inequality. 

M 

<  1 

i=l 

Constraint  6:  Exactly  one  V  argument  Eor  each 
verb,  there  is  one  and  has  to  be  one  V  argument, 
which  represents  the  active  verb.  Similarly,  this  con¬ 
straint  can  be  represented  by  the  following  equality. 

M 

y^^ziY  =  I 

i=l 

Constraint  7:  V-Al-C-V  pattern  This  con¬ 
straint  is  only  useful  when  there  are  three  consec¬ 
utive  candidate  arguments  in  a  sentence.  Suppose 
arguments  5-^* ,  are  consecutive.  If  is 

C-V,  then  and  have  to  be  V  and  Al,  respec¬ 
tively.  This  if-then  constraint  can  be  represented  by 
the  following  two  linear  inequalities. 

%c-v  >  ^jiv,  and  zj^c-w  >  ^i2Ai 

Constraint  8:  R-XXX  arguments  Suppose  the 
referenced  argument  type  is  AO  and  the  reference 
type  is  R-AO.  The  linear  inequalities  that  represent 
this  constraint  are: 

M 

Vm  G  {1, . . . ,  M}  :  E  ZiAO  >  ^mR-AO 

i=\ 

If  there  are  7  reference  argument  pairs,  then  the 
total  number  of  inequalities  needed  is  7M. 


Constraint  9:  C-XXX  arguments  This  con¬ 
straint  is  similar  to  the  reference  argument  con¬ 
straints.  The  difference  is  that  the  continued  argu¬ 
ment  XXX  has  to  occur  before  C-XXX.  Assume 
that  the  argument  pair  is  AO  and  C-AO,  and  argu¬ 
ment  Sj^  appears  before  Sj^  if  i  <  /c.  The  linear 
inequalities  that  represent  this  constraint  are: 

i-i 

Vm  G  {2, . . . ,  M}  :  E  ZjiAO  >  2:mR-A0 

i=l 

Constraint  10:  Illegal  argument  types  Given  a 
specific  verb,  some  argument  types  should  never  oc¬ 
cur.  For  example,  most  verbs  don’t  have  arguments 
A5.  This  constraint  is  represented  by  summing  all 
the  corresponding  indicator  variables  to  be  0. 

M 

ziA5  =  0 

i=l 

Using  ILP  to  solve  this  inference  problem  en¬ 
joys  several  advantages.  Linear  constraints  are 
very  general,  and  are  able  to  represent  many  types 
of  constraints.  Previous  approaches  usually  rely 
on  dynamic  programming  to  resolve  non  over¬ 
lapping/embedding  constraints  (i.e..  Constraint  4) 
when  the  data  is  sequential,  but  are  unable  to  han¬ 
dle  other  constraints.  The  ILP  approach  is  flexible 
enough  to  handle  constraints  regardless  of  the  struc¬ 
ture  of  the  data.  Although  solving  an  ILP  prob¬ 
lem  is  NP-hard,  with  the  help  of  todays  commer¬ 
cial  numerical  packages,  this  problem  can  usually 
be  solved  very  fast  in  practice.  For  instance,  it  only 
takes  about  10  minutes  to  solve  the  inference  prob¬ 
lem  for  4305  sentences  on  a  Pentium-Ill  800  MHz 
machine  in  our  experiments.  Note  that  ordinary 
search  methods  (e.g.,  beam  search)  are  not  neces¬ 
sarily  faster  than  solving  an  ILP  problem  and  do  not 
guarantee  the  optimal  solution. 

5  Experimental  Results 

The  system  is  evaluated  on  the  data  provided  in 
the  CoNLL-2004  semantic-role  labeling  shared  task 
which  consists  of  a  portion  of  PropBank  corpus. 
The  training  set  is  extracted  from  TreeBank  (Mar¬ 
cus  et  al.,  1993)  section  15-18,  the  development  set, 
used  in  tuning  parameters  of  the  system,  from  sec¬ 
tion  20,  and  the  test  set  from  section  21. 

We  first  compare  this  system  with  the  basic  tagger 
that  we  have,  the  CSCL  shallow  parser  from  (Pun- 
yakanok  and  Roth,  2001),  which  is  equivalent  to  us¬ 
ing  the  scoring  function  from  the  first  phase  with 
only  the  non-overlapping/embedding  constraints.  In 


Free. 

Rec. 

F/3=1 

U‘-phase,  non-overlap 

70.54 

61.50 

65.71 

U‘ -phase.  All  Const. 

70.97 

60.74 

65.46 

2"'^ -phase,  non-overlap 

69.69 

64.75 

67.13 

2"'^ -phase.  All  Const. 

71.96 

64.93 

68.26 

Table  1 :  Summary  of  experiments  on  the  development 
set.  All  results  are  for  overall  performance. 


Precision 

Recall 

II 

Without  Inference 

86.95 

87.24 

87.10 

With  Inference 

88.03 

88.23 

88.13 

Table  2:  Results  of  second  phase  phrase  prediction 
and  inference  assuming  perfect  boundary  detection  in 
the  first  phase.  Inference  improves  performance  by  re¬ 
stricting  label  sequences  rather  than  restricting  structural 
properties  since  the  correct  boundaries  are  given.  All  re¬ 
sults  are  for  overall  performance  on  the  development  set. 

addition,  we  evaluate  the  effectiveness  of  using  only 
this  constraint  versus  all  constraints,  as  in  Sec.  4. 

Table  1  shows  how  additional  constraints  over  the 
standard  non-overlapping  constraints  improve  per¬ 
formance  on  the  development  set.  The  argument 
scoring  is  chosen  from  either  the  first  phase  or  the 
second  phase  and  each  is  evaluated  by  considering 
simply  the  non-overlapping/embedding  constraint 
or  the  full  set  of  linguistic  constraints.  To  make 
a  fair  comparison,  parameters  were  set  separately 
to  optimize  performance  when  using  the  first  phase 
results.  In  general,  using  all  constraints  increases 
F^=i  by  about  1%  in  this  system,  but  slightly  de¬ 
creases  the  performance  when  only  the  first  phase 
classifier  is  used.  Also,  using  the  two-phase  archi¬ 
tecture  improves  both  precision  and  recall,  and  the 
enhancement  reflected  in  F^=i  is  about  2.5%. 

It  is  interesting  to  find  out  how  well  the  second 
phase  classifier  can  perform  given  perfectly  seg¬ 
mented  arguments.  This  evaluates  the  quality  of  the 
argument  classifier,  and  also  provides  a  conceptual 
upper  bound.  Table  2  first  shows  the  results  without 
using  inference  (i.e.  JF('P^)  =  "P^).  The  second 
row  shows  adding  inference  to  the  phrase  classifica¬ 
tion  can  further  improve  F/3=i  by  1%. 

Finally,  the  overall  result  on  the  official  test  set 
is  given  in  Table  3.  Note  that  the  result  here  is  not 
comparable  with  the  best  in  this  domain  (Pradhan  et 
al.,  2004)  where  the  full  parse  tree  is  assumed  given. 
For  a  fair  comparison,  our  system  was  among  the 
best  at  CoNLL-04,  where  the  best  system  (Hacioglu 
et  al.,  2004)  achieve  a  69.49  FI  score. 

6  Conclusion 

We  show  that  linguistic  information  is  useful  for  se¬ 
mantic  role  labeling,  both  in  extracting  features  and 


Dist. 

Prec. 

Rec. 

F/3=1 

Overall 

100.00 

70.07 

63.07 

66.39 

AO 

26.87 

81.13 

77.70 

79.38 

A1 

35.73 

74.21 

63.02 

68.16 

A2 

7.44 

54.16 

41.04 

46.69 

A3 

1.56 

47.06 

26.67 

34.04 

A4 

0.52 

71.43 

60.00 

65.22 

AM-ADV 

3.20 

39.36 

36.16 

37.69 

AM-CAU 

0.51 

45.95 

34.69 

39.53 

AM-DIR 

0.52 

42.50 

34.00 

37.78 

AM-DIS 

2.22 

52.00 

67.14 

58.61 

AM-EXT 

0.15 

46.67 

50.00 

48.28 

AM-LOC 

2.38 

33.47 

34.65 

34.05 

AM-MNR 

2.66 

45.19 

36.86 

40.60 

AM-MOD 

3.51 

92.49 

94.96 

93.70 

AM-NEG 

1.32 

85.92 

96.06 

90.71 

AM-PNC 

0.89 

32.79 

23.53 

27.40 

AM-TMP 

7.78 

59.77 

56.89 

58.30 

R-AO 

1.66 

81.33 

76.73 

78.96 

R-Al 

0.73 

58.82 

57.14 

57.97 

R-A2 

0.09 

100.00 

22.22 

36.36 

R-AM-TMP 

0.15 

54.55 

42.86 

48.00 

Table  3:  Results  on  the  test  set. 


deriving  hard  eonstraints  on  the  output.  We  also 
demonstrate  that  it  is  possible  to  use  integer  linear 
programming  to  perform  inferenee  that  ineorporates 
a  wide  variety  of  hard  eonstraints,  whieh  would  be 
diffieult  to  ineorporate  using  existing  methods.  In 
addition,  we  provide  further  evidenee  supporting 
the  use  of  seoring  arguments  over  seoring  argument 
boundaries  for  eomplex  tasks.  In  the  future,  we  plan 
to  use  the  full  PropBank  eorpus  to  see  the  improve¬ 
ment  when  more  training  data  is  provided.  In  addi¬ 
tion,  we  would  like  to  explore  the  possibility  of  in¬ 
teger  linear  programming  approaeh  using  soft  eon¬ 
straints.  As  more  eonstraints  are  eonsidered,  we  ex¬ 
pect  the  overall  performance  to  improve. 
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